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Abstract 


This  report  summarizes  the  results  of  an  extensive  research  program  on  the  real-time  im¬ 
plementation  of  multidimensional  (M-D)  digital  signal  processing  algorithms.  We  began  our 
study  on  the  efficient  implementation  of  M-D  digital  filters.  We  mapped  the  M-D  digital 
filter  to  a  state  space  model  because  the  state  space  model  supports  local  data  communi¬ 
cations.  We  studied  various  approaches  to  implementing  the  state  space  model  for  M-D 
digital  signal  processing  applications.  We  found  that  the  best  approach  involves  mapping 
the  state  space  model  onto  a  generalized  linear  finite  state  machine  which  facilitates  the 
hardware  implementation.  Using  this  approach,  we  were  able  to  develop  a  multiprocessor 
system  architecture  which  is  scalable,  which  is  modular,  and  which  has  a  high  efficiency. 
Based  upon  these  results,  we  developed  the  architecture  for  an  application  specific  comput¬ 
ing  system  which  we  call  a  Block  Data  Flow  Architecture  (BDFA).  We  are  currently  studying 
the  mapping  of  several  other  M-D  signal  processing  algorithms  and  matrix  operations  to  the 
BDFA.  These  studies  show  that  multiprocessor  systems  using  the  BDFA  can  achieve  high 
throughput  and  high  efficiency  at  a  modest  cost. 


Chapter  1 


Introduction. 


1.1  Statement  of  the  Problem. 


Extensive  research  and  development  have  been  devoted  to  multidimensional  (M-D)  digi¬ 
tal  signal  processing  [1].  Recently,  there  has  been  a  dramatic  increase  in  the  performance 
of  computer  systems.  Thus,  it  has  become  more  practical  to  implement  many  of  the  M- 
D  digital  signal  processing  applications  in  real-time.  Practical  applications  of  M-D  digital 
signal  processing  include  remote  sensing,  industrial  inspection,  robot  vision,  data  compres¬ 
sion  for  communications,  processing  biomedical  images  for  diagnosis,  character  recognition, 
recognition  of  figure  prints,  weather  forecasting,  etc.  In  general,  these  applications  are  com¬ 
putationally  intensive  and  require  substantial  data  communications. 

In  many  cases,  the  reduction  of  computer  hardware  cost  makes  it  practical  to  design 
special  purpose  computer  systems  tailored  to  the  specific  requirements  of  a  given  class  of 
algorithms.  Systems  with  the  computational  capability  to  handle  real-time  or  near  real¬ 
time  M-D  digital  signal  processing  are  just  becoming  available  as  a  result  of  these  efforts. 
However,  most  M-D  digital  signal  processing  tasks  are  too  complicated  to  implement  in  real¬ 
time  using  a  single  processor  system.  Thus,  the  development  of  M-D  digital  signal  processing 
algorithms  specifically  designed  for  multiprocessor  systems  is  an  import'.r.t  research  area. 

In  this  research  program,  we  have  concentrated  on  the  development  of  algorithms 
which  can  be  effectively  used  for  high  speed,  M-D  digital  signal  processing  in  a  multiproces¬ 
sor  or  multicomputer  environment.  The  traditional  approach  to  research  on  the  development 
of  efficient  algorithms  for  digital  signal  processing  is  to  reduce  the  total  number  of  multi¬ 
plications  (or  complex  multiplications)  required.  However,  this  approach  is  not  valid  for 
algorithms  to  be  implemented  on  a  state-of-the-art  multiprocessor  system.  For  example,  the 
transfer  of  a  data  word  between  chips  in  a  multiple  chip  system  (typically  on  the  order  of  30 
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to  100  nanoseconds)  can  require  as  much  time  as  required  for  a  multiplication.  Thus,  data 
communications  requirements  should  be  given  at  least  equal  consideration  to  computational 
complexity  in  developing  algorithms  for  multiprocessor  systems. 

We  began  our  research  program  on  the  real-time  implementation  of  two-dimensional 
(2-D)  digital  filters.  We  later  generalized  our  results  to  include  all  discrete,  linear,  shift- 
invariant  (DLSI)  M-D  systems.  A  DLSI  system  is  a  discrete  system  for  which  the  system 
parameters  do  not  vary  with  changes  in  the  independent  variables  (time,  space,  distance, 
range,  etc.).  Thus,  the  coeflficients  for  the  finite  difference  equation  representation  of  a  DLSI 
system  are  constants.  A  finite  difference  equation  expresses  the  result  of  a  computation  as  a 
weighted  average  of  current  and  previous  inputs  and  past  outputs.  Quite  often  the  indepen¬ 
dent  and  dependent  variables  are  parameters  such  ais  time,  space,  range,  temperature,  etc. 
Many  practical  digital  signal  processing  and  digital  control  problems  can  be  represented  as 
DLSI  systems.  In  addition,  many  shift  variant  systems  can  be  approximated  over  small  inter¬ 
vals  as  DLSI  systems.  Our  approach  has  been  to  design  computationally  efficient  algorithms 
which  are  optimized  for  implementing  M-D  DLSI  systems  in  a  multiprocessor  environment. 
In  this  way,  our  results  can  be  applied  to  a  large  variety  of  problems. 

Real-time  M-D  digital  signal  processing  hats  a  wide  range  of  applications  such  as 
radar  and  sonar  signal  processing,  biomedical  diagnosis,  photography,  broadcast  television, 
computer  vision,  and  seismology.  Computational  requirements  of  signal  processing  tasks 
such  as  beam-forming,  adaptive  filtering,  data  compression  and  parameter  estimation  can 
be  reduced  to  a  common  set  of  matrix  operations [2].  Matrix  operations  also  find  important 
applications  in  many  areas  such  as  oceanography,  weather  prediction,  dynamic  quantum 
field  theory,  aerodynamics,  petroleum  exploration,  astrophysics,  fluid  mechanics,  geophysics 
and  particle  physics.  These  applications  require  a  system  with  high  throughput  and  high 
efficiency  for  real-time  implementation. 

Normally,  signal  processing  tasks  and  matrix  operations  possess  a  large  amount  of 
inherent  parallelism.  Many  parallel  algorithms  and  parallel  structures  have  been  developed 
to  exploit  this  parallelism[3][4].  However,  most  parallel  algorithms  have  been  optimized  for 
implementation  on  general  purpose  computers.  General  purpose  computers  can  not  achieve 
the  high  system  throughput  required  for  real-time  processing  because  of  limitations  due  to 
system  management  and  control  overhead  and  data  communication  problems.  Data  com¬ 
munications  requirements  are  very  important  in  developing  multiprocessor  implementations 
of  these  algorithms. 

Most  parallel  multiprocessor  system  such  as  systolic  arrays  and  hypercube  multi¬ 
processor  systems  have  a  synchronous  SIMD  structure.  A  synchronous  system  achieves  its 
parallelism  by  synchronous  clock-step  operations  [5].  This  implies  that  all  operands  have  to 
be  ready  before  any  processor  can  start  its  designated  operation.  This  strictly  synchronous 
operation  imposes  a  severe  timing  restriction  on  the  system  design  and  causes  implementa¬ 
tion  difficulties  such  as  the  clock  skew  problem  for  large  scale  systems.  Thus,  the  throughput 
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rate  of  most  multiprocessor  systems  fails  to  increcise  linearly  proportional  to  the  increase  in 
the  number  of  processors. 

A  wavefront  array  replaces  the  requirement  for  correct  timing  by  a  requirement  for 
correct  sequencing  to  overcome  the  globally  synchronous  timing  problem[6].  However,  if  the 
handshaking  for  the  wavefront  array  is  done  at  the  word  level,  then  the  resources  required 
to  implement  the  handshaking  protocol  limit  the  overall  system  efficiency  and  throughput. 
The  BDFA  is  essentially  a  wavefront  array  with  a  block  data  handshaking  protocol.  Thus, 
the  BDFA  has  the  asynchronous  timing  advantages  of  the  wavefront  array  but  it  can  still 
have  a  very  high  efficiency  because  of  the  reduction  of  overhead  due  to  handshaking  for  data 
communications. 

Algorithms  designed  for  systolic  arrays  and  wavefront  arrays  use  an  algorithm  parti¬ 
tioning  strategy.  In  an  algorithm  partitioning  strategy,  each  processor  implements  a  different 
part  of  the  algorithm  and  the  total  problem  is  solved  using  a  pipeline.  The  use  of  this  strategy 
may  lead  to  unnecessary  data  movement  among  processing  elements  because  only  the  edge 
processors  have  access  to  input  or  output  devices.  Processing  results  go  through  processor 
by  processor  in  order  to  reach  the  one  which  can  interface  to  the  output  device.  This  unnec¬ 
essary  data  movement  may  increase  system  management  and  data  communication  overhead, 
and  may  increase  hardware  complexity.  It  also  increases  the  data  dependency  among  the 
processing  elements. 

While  it  is  possible  to  obtain  impressive  performance  with  bus-organized  multipro¬ 
cessor  systems  and  multiprocessor  array  systems  for  individual  algorithms,  the  performance 
typically  falls  off  due  to  data  communication  problems  and/or  synchronization  problems  as 
the  number  of  processors  is  increased.  In  addition,  hardware  especially  designed  for  a  given 
algorithm  either  can  not  be  adapted  to  solve  other  problems  or  the  performance  is  dras¬ 
tically  reduced  on  other  problems.  We  have  attempted  to  develop  an  application  specific 
architecture  which  can  solve  a  class  of  problems  with  high  throughput  and  high  efficiency. 
We  expect  this  approach  to  result  in  a  cost  effective  solution  to  demanding  M-D  digital  signal 
processing  problems. 


1.2  Application  Specific  Computing  Systems 


Although  the  primary  goal  of  our  research  program  is  the  development  of  algorithms  and 
computational  structures  for  high  performance  M-D  digital  signal  processing  applications,  a 
secondary  goal  of  our  research  program  is  the  development  of  application  specific  computing 
systems  for  digital  signal  processing  with  emphasis  on  real-time  applications.  We  are  espe¬ 
cially  interested  in  computationally  intensive  M-D  digital  signal  processing  applications  such 
as  beam-forming,  M-D  digital  filtering,  discrete  transforms,  adaptive  filters,  etc.  Since  many 
of  these  applications  can  be  formulated  as  matrix  operations,  we  include  matrix  operations 
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in  the  desired  family  of  algorithms. 

We  developed  the  BDFA  to  have  the  flexibility  to  efficiently  solve  a  variety  of  problems 
in  this  class  of  algorithms  while  still  providing  the  high  throughput  for  real-time  applications. 
By  exploiting  the  regularity  and  inherent  parallelism  in  these  applications,  we  found  that 
many  M-D  signal  processing  and  matrix  algorithms  can  be  solved  using  a  data  partitioning 
strategy.  In  a  data  partitioning  strategy,  each  processor  receives  a  different  portion  of  the 
data  and  attempts  to  complete  all  of  the  necessary  computations  for  its  assigned  data  par¬ 
tition.  We  eliminate  data  communications  to  other  processors  when  possible  and  minimize 
it  when  it  is  necessary. 

We  choose  the  data  partitioning  strategy  for  the  BDFA  to  reduce  data  dependency 
between  processors,  to  reduce  interprocessor  communication,  and  to  simplify  the  intercon¬ 
nection  network.  In  a  BDFA,  input  (output)  data  can  be  moved  directly  into  (from)  any 
processor  without  interfering  with  any  other  processors.  Thus,  in  this  scheme,  the  interpro¬ 
cessor  communications  only  involve  the  passing  of  intermediate  computational  results. 

The  interprocessor  communications  for  the  BDFA  are  in  only  one  direction  by  design. 
This  permits  the  use  of  FIFO  buffers  for  interprocessor  data  communications.  The  FIFO 
buffers  also  provide  asynchronous  data  communication  capability  which  in  turn  relaxes  the 
requirement  for  strict  timing  between  processors.  This  is  an  important  advantage  as  we 
increase  the  number  of  processors  in  a  BDFA. 

In  mapping  a  given  algorithm  to  a  BDFA  system,  we  try  to  minimize  the  interpro¬ 
cessor  communications  and  data  movements  since  they  affect  system  throughput,  system 
efficiency  and  system  management  overhead.  Secondly,  the  BDFA  maintains  a  direct  input 
data  channel  to  each  processor  and  a  direct  output  data  channel  from  each  processor  to 
substantially  reduce  the  required  data  movements  for  the  processor  array. 


1.3  Overview  of  the  Report. 


This  report  presents  a  summary  of  the  results  achieved  under  Office  of  Naval  Research 
contract  N00014-83-K-0138.  In  chapter  2,  we  develop  the  state  space  representations  for  the 
2-D  and  the  M-D  discrete  linear  shift-invariant  (DLSI)  systems.  We  use  these  state  space 
representations  to  obtain  the  computational  structure  for  real-time  implementation  on  M-D 
DLSI  systems.  In  chapter  3,  we  present  the  architectual  features  of  the  BDFA.  We  also 
present  a  performance  evaluation  of  the  use  of  the  BDFA  for  the  2-D  digital  filters.  The 
outstanding  performance  on  this  problem  has  encouraged  us  to  consider  the  use  of  the  BDFA 
for  other  M-D  digital  signal  processing  problems[8],[7]. 


Chapter  2 


State  Space  Representation  of  M-D  Digital 
Systems. 


The  state  space  representation  provides  the  potential  for  minimizing  the  data  communication 
requirements  for  a  given  algorithm  without  increasing  computational  complexity.  Other 
advantages  of  the  state  space  implementation  over  direct  implementation  include  decreased 
sensitivity  to  parameter  variations  and  improved  performance  when  finite  arithmetic  is  used. 

A  set  of  finite  difference  equations  is  one  of  the  forms  commonly  used  for  representing 
DLSI  systems.  We  have  chosen  this  mathematical  abstraction  as  a  convenient  starting 
point  for  development  of  the  algorithm  decomposition  scheme  for  implementing  the  M-D 
DLSI  system.  The  first  step  in  the  procedure  is  the  state  space  representation  of  the  system. 
Although  we  show  the  development  of  the  state  space  representation  from  the  finite  difference 
equation,  we  can  also  obtain  a  state  space  representation  from  a  signal  flow  graph  or  a  block 
diagram  representation.  We  use  the  state  space  representation  as  an  intermediate  form  for 
representing  the  system.  In  order  to  clearly  explain  the  concepts  involved  in  this  approach, 
we  first  discuss  the  state  space  implementation  of  2-D  DLSI  systems.  We  then  show  that 
the  concepts  used  in  the  2-D  case  can  be  extended  to  the  M-D  case  {M  >  2). 


2.1  State  Space  Representation  of  2-D  DLSI  Systems 


The  general-order,  causal  2-D  finite  difference  equation  with  quarter-plane  support  is  given 
by  [1]. 


L]  Zr2  Ll  1^2 

g{ni,n2)  =  X]  J2)/(ni -ii,n2-j2)  -  £  £  "2  -  jz)  (2.1) 

h  =0  JJ=0  ji  =0  >2=0 

>1+>J>0 


7 


The  parameters  a(ji, 72)  and  6(71,72)  in  the  above  equation  are  coeflBcients  which  determine 
the  characteristics  of  the  algorithm.  Since  the  coefficients  can  take  on  arbitrary  values,  this 
equation  can  represent  many  2-D  problems  including  spatial  domain  filters,  image  processing, 
simulation,  control  systems,  etc.  The  state  space  approach  can  be  extended  to  the  2-D  DLSI 
system  [9]  [10].  For  the  1-D  case,  a  simularity  transformation  can  be  used  to  optimize  the 
state-space  representation  for  a  given  criteria.  However,  there  is  a  fundamental  problem  in 
extending  this  concept  to  the  2-D  case  because  an  arbitrary  bivariate  transfer  function  cannot 
be  factored  into  distinct  poles  and  zeros  and  cannot  be  expanded  into  partial  fractions.  Thus, 
these  approaches  to  developing  a  parallel  or  cascade  implementation  are  not  extendible  to 
M-D  systems  due  to  the  lack  of  a  fundamental  theory  of  algebra  for  M-D  systems. 

Roesser’s  state  space  model  for  2-D  DLSI  systems  is  perhaps  the  most  widely  accepted 
model  [9].  This  model  provides  for  the  update  of  the  next  state  for  a  set  of  vertical  state 
variables  and  a  set  of  horizontal  state  variables  as  a  linear  combination  of  the  present  vertical 
and  horizontal  state  variables  and  the  current  input.  The  output  is  also  a  linear  combination 
of  the  present  vertical  and  horizontal  state  variables  and  the  current  input. 

Snini  +  1,112)  _  Ai  A2  5//(ni,n2)  Bi 

Sv{ni,n2  + 1)  _  _  A3  A4  Svini,n2)  . 

Wn„n,)|  =  [Cj  C2I  +  D[/(,=.,n,)]  (2,2) 

Roesser’s  state  space  model  is  based  upon  assigning  state  variables  to  the  output  of  the 
delay  elements.  We  find  it  more  convenient  to  cissign  state  variables  to  the  input  of  the  delay 
elements.  This  makes  the  state  space  representation  compatible  with  the  evential  hardware 
implementation  because  a  state  variable  identifies  a  parameter  which  must  be  stored  for  later 
use.  This  alternate  choice  for  the  state  variables  is  equivalent  to  the  parameter  substitution: 

QHini,n2)  =  5i/(ni-M,n2) 

Qv(ni,n2)  =  5v(ni,n2  +  l)  (2.3) 

With  this  substitution,  the  indices  for  the  modified  state  vector  are  the  same  as  those  for 
the  current  input.  Thus,  the  modified  model  is  conceptually  simpler  because  it  more  closely 
resembles  the  finite  difference  equation  model.  It  also  simplifies  our  later  derivations  for  the 
block-state  model  and  the  development  of  the  initial  conditions  models. 

We  can  combine  the  vertical  state  variables  and  the  horizontal  state  variables  into  a 
state  vector  for  a  given  location  in  the  M-D  array.  Thus, 

We  can  then  update  this  state  vector  and  compute  the  current  output  using  a  linear  combi¬ 
nation  of  the  most  recent  vertical  state  variables,  the  most  recent  horizontal  state  variables 
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and  the  current  input.  The  revised  model  is  equivalent  to  Roesser’s  original  model.  However, 
the  notation  more  accurately  reflects  the  computational  model  and  the  resulting  architecture 
presented  in  this  report. 

The  modified  state  model  for  the  causal  2-D  DLSI  system  with  quarter  plane  support 
is  given  by 


Qh(72i,  722) 

Ai  A2 

(5v(72i,  722) 

A3  A4 

QH{ni  -  l,n2) 
Qv{ni,n2  -  1) 


+ 


B2 


[Cl  C2I 


Qv{ni,n2  -  1) 


+  D[/(ni,n2)] 


(2.5) 


In  this  equation,  Qiy(ni,n2)  is  a  column  vector  whose  elements  are  the  current  values  of 
the  state  variables  for  the  horizontal  processing  direction  corresponding  to  the  index  ni. 
Qv{n\,Ti2)  is  a  column  vector  whose  elements  are  the  current  values  of  the  state  variables 
for  the  vertical  processing  direction  corresponding  to  the  index  712-  The  index  (ni  —  1,722) 
implies  a  delay  in  the  horizontal  direction  and  the  index  (721,722  —  1)  implies  a  delay  in 
the  v<  rtical  direction.  Ai,  A2,  A3,  A4,  Bi,  B2,  Ci,  C2,  and  D  are  appropriate  coefficient 
matrices  such  that  Eq.  2.1  and  Eq.  2.5  are  equivalent.  Fig.  2.1  gives  a  block  diagram  for 
a  linear  finite  state  machine  which  is  equivalent  to  the  state  space  representation  of  the  2-D 
DLSI  system  given  in  Eq.  2.5.  The  linear  finite  state  machine  for  Eq.  2.2  is  identical  except 
for  the  designation  of  variables  as  specified  in  Eq  2.3. 


A  state  variable  represents  information  that  must  be  stored  for  later  use.  Therefore, 
it  is  important  to  select  state  variables  that  minimize  data  communication  requirements 
without  increasing  computational  complexity.  In  a  typical  image  processing  application,  a 
horizontal  delay  represents  a  storage  of  one  word  while  a  vertical  delay  represents  a  storage 
of  an  entire  row  of  data.  Therefore  we  selected  a  canonical  form  which  minimizes  the  number 
of  vertical  state  variables. 


The  state  space  representation  for  a  given  2-D  DLSI  «ystem  is  not  unique.  In  addi¬ 
tion,  the  problem  of  defining  a  representation  with  the  minimum  number  of  states  has  not 
been  solved  [11].  We  choose  a  particular  canonical  form  to  facilitate  the  development  of  a 
computational  primitive  for  2-D  DLSI  systems.  We  then  assign  state  variables  to  the  inputs 
of  the  delay  elements  to  obtain  the  state  /ariable  representation.  The  procedures  which  we 
use  are  general  and  can  be  applied  to  obtain  a  state  space  representation  from  any  signal 
flow  graph  or  block  diagram. 
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Figure  2.1:  Two-dimensional  generalized  finite  state  machine. 


2.1.1  Deriving  the  2-D  State  Space  Equations. 


The  2-D  transfer  function  corresponding  to  Eq.  2.1  is  given  by 

H{z,^z,)  =  - — 

1  +  £  £ 

jl=0  Jj=0 

ii+js>o 


Note  that  H{zi,Z2)  describes  an  input/output  relationship  between  the  transform  of  the 
input  sequence,  F{zi,Z2),  and  the  transform  of  the  output  sequence,  G{zi,Z2).  We  can  show 


10 


this  relationship  as  follows: 

G{zi,Z2)  =  6(0,0)F(2i,Z2)  +  '^^[Kju32)F{z\^Z2)-a{h,j2)G{zi,Z2)]z:[^^Z2^'‘ 

ij=o  ji=!0 
}i+ii>o 

(2.7) 

Fig.  2.2  gives  a  block  diagram  representation  of  a  2-D  filter  partitioned  cis  specified  by  Eq. 
2.7.  Note  that  the  number  of  vertical  delays  is  the  same  cis  the  order  of  the  filter  in  the 


f(ni,n2) 


Z2  variable  which  is  the  minimum  possible  number  as  desired.  We  can  obtain  the  desired 
state  space  representation  by  assigning  a  horizontal  state  variable  to  the  input  of  each  of  the 
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horizontal  delay  blocks  (associated  with  the  z\  variable)  and  a  vertical  state  variable  to  each 
of  the  vertical  delay  blocks  (associated  with  the  Z2  variable).  We  then  write  the  resulting 
equations  in  matrix  form  as  given  in  Eq.  2.5. 

Fig.  2.3  gives  a  section  of  the  block  diagram  of  the  2-D  DLSI  system  having  one 
horizontal  delay  and  one  vertical  delay.  Assigning  state  variables  as  described  above,  the 

q2j(ni4»2'l) 


Figure  2.3;  Section  of  the  block  diagram  of  a  2-D  DLSI  system. 

typical  vertical  state  equation  for  the  2-D  DLSI  system  can  be  represented  eis 

92, 7,(^1, ”2)  =  K0,;2)/(ni,n2)  -  a(0, j2)9(”i> "2) 

+  9l,/i(”l  ~  1>”2)  +  92.73+1(711,  ”2  —  1)  ;  1  <  J2  <  L2 

I\  =  L\j2  4-  1 

h  =  32 

92,L2+l(7ll,n2  - 1)  =  0  (2.8) 

In  a  similar  way,  the  typical  horizontal  state  variable  is  given  by 

9i,7i(ni,n2)  =  Kii>i2)/(ni,n2)  -  a(ji,;2)9(ni,n2) 

+  9l,7i+l(71l  —  1,712);  1  £  Jl  ^  ■t'l  ~  1  ;  0  <  72  <  1-2 

1 1  =  72^1  +  jl  • 


(2.9) 


9i,/i(ni,n2)  =  b{ji,j2)f{ni,n2)  -  a(ji, j2)^(ni, nj)  ; 

ji  =  Li  ;  0  j2  ^  L2  ■ 

h  =  (is  +  1)^1  • 

The  output  equation  is  given  by 

£f(ni,n2)  =  6(0,0)/(ni,n2)  +  ~  ”2)  +  92,i(”i5 ^2  —  1) 
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(2.10) 

(2.11) 


The  vertical  and  horizontal  state  variables  can  then  be  represented  by  [12] 


9.', /;(«!, ^2)  =  6,',/i/(ni,n2)  -  a.-./.9(«i, ”2)  +  9.-,/.+i(n,- -  1, nj) 

+  97,i;(«i,n7- 1)  (2.12) 


If  i  =  1  then  i  =  2  and  vice  versa.  Note  that  if  i  =  1  in  Eq.  2.12,  then  the  corresponding 
vertical  state  variable  is  equal  to  zero  [97,/j(n«5«7  ~  1)  =  0]- 


Eq.  2.12  is  a  computational  primitive  for  the  2-D  DLSI  system  since  the  vertical  state 
variables,  the  horizontal  state  variables  and  the  output  can  be  mapped  into  this  equation 
with  a  suitable  interchange  of  variables.  In  using  Eq.  2.12  as  a  computational  primitive, 
(ni,n2)  can  represent  the  current  value  of  the  horizontal  or  vertical  state  variable  or  the 
output  as  appropriate,  9,',/^_i(n,-  -  l,n7)  represents  a  previous  value  of  a  horizontal  state 
variable  (delayed  by  one  pixel),  97,i.(nj,n7  -  1)  represents  a  previous  value  of  a  vertical  state 
variable  (delayed  by  one  row).  We  can  implement  this  equation  in  a  tree  structure  using  two 
multipliers  and  three  adders  [12]  as  shown  in  Fig.  2.4  or  in  two  steps  using  a  multiplier-adder. 


2.2  State  Space  Representation  of  M-D  DLSI  Sys¬ 
tems. 

We  now  discuss  the  extension  of  the  2-D  state  space  implementation  presented  above  to  M-D 
DLSI  systems.  The  general  multivariable  difference  equation  for  the  causal,  DLSI  system 
with  first  section  support  (the  M-D  equivalent  of  quarter  plane  support)  is  given  by  [1] 

9(n)  =  IZ  H  •  53  KJ)/("i  -  ;i.  •  •  • ,  -  Jm) 

ji=ojj=o  iu—o 
L\  Lu 

53  23  •  •  •  13  -  ji.  •  •  ’ . "A/  -  Jm) 

=0^2=0 

jl  +j2+— +iM>0 


n 


(2.13) 
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Figure  2.4:  Tree  structure  for  the  computational  primitive.. 


The  input  /(n)  is  assumed  to  be  sampled  at  uniform  intervals  in  each  of  the  independent 
variables  and  ^(n)  is  the  corresponding  output.  The  parameters  o(J)  and  b(J)  are  coefficients 
which  determine  the  characteristics  of  the  algorithm.  Since  the  coefficients  can  take  on 
arbitrary  values  as  appropriate,  this  equation  can  represent  many  common  M-D  problems. 


The  state  space  representation  of  the  M-D  DLSl  system  is  given  by  [13] 


5’i(ni  -1- 1, 722,  •  •  •  1  nuf) 

All  Ai2  •  •  •  Aim 

r  S^(n)  ] 

52(721,222  -f  1,  .  .  . 

A21  A22  •  •  •  A2M 

52(n) 

.  5^(221,222,.  ■  •  ,22m  +  1)  . 

.  Ami  Am2  •  •  •  Amm  . 

.  5M(n)  . 

+ 


B2 


[/(n)] 


LBm  J 


[y(n)] 


[Cl  C2  ...  Cm] 


5i(n)  1 

52(n) 


+  D[/(n)] 


L  ^M(n)  J 


(2.14) 
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We  choose  state  variables  for  the  M-D  c<ise  in  the  same  manner  as  we  did  with  the  2-D  case. 
This  is  equivalent  to  the  following  parameter  substitution  : 

+  ;  I  <i  <  M.  (2.15) 

With  this  substitution,  the  indices  for  the  modified  state  vectors  are  the  same  as  those  for 
the  current  input  and  the  state  variables  are  updated  as  a  linear  combinations  of  the  delayed 
state  variables  and  the  current  input.  Thus,  the  theoretical  model  more  closely  matches  the 
computational  model  and  is  consistent  with  the  difference  equation  notation.  The  resulting 
state  space  representation  for  the  M-D  DLSI  system  is  given  by 

gi(n)  1  .  r  All  Ai2  •••  Aim  f  QiCni  - ”2, •••,”*/) 

g2(n)  _  A21  A22  •••  A2M  ^2(^1,722  -  l,...,nA/) 

gA/(n)  J  [Ami  Am2  Amm  J  L  gAf(ni,n2,...,nA/ -  1) 

■  Bi  • 

B2 

+  :  [/(n)l 

.  Bm  . 

gj(ni  -  l,n2,...,nM) 

^2(^1, 722  -  1,...,72m) 

1) 

+  D[/(n)]  (2.16) 

We  can  use  the  approach  we  used  for  the  2-D  DLSI  system  to  obtain  a  state  space  repre¬ 
sentation  of  the  M-D  DLSI  system.  First,  we  obtain  a  suitable  computational  graph  for  the 
M-D  system.  Then  we  assign  the  input  to  each  delay  21s  a  state  variable  in  the  corresponding 
tuple.  In  the  development  of  the  M-D  state  space  implementation  that  follows,  we  have 
chosen  a  canonical  form  which  minimizes  the  number  of  state  variables  in  the  Mth  tuple. 
This  is  comparable  to  choosing  a  canonical  form  to  minimize  the  number  of  veritcal  state 
variables  in  the  2-D  DLSI  system. 

If  we  express  the  state  equations  and  the  output  equation  for  the  M-D  system  in 
matrix  form,  then  we  have  the  M-D  state  space  model  as  given  in  Eq.  2.16.  For  convenience, 
we  define 


[^(n)]  =  [Cl  C2  ...  Cm] 


g(n)  = 


QiCn) 

g2(n) 

g^cn) 
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where  ki^i  and  may  either  be  0  or  1  depending  upon  whether  the  associated  delayed 
state  variable  appears  in  the  equation  for  state  variable  qi,/_i(n)  or  qi,i„  respectively. 

Eq.  2.30  is  a  computational  primitive  for  the  M-D  DLSI  system  since  the  state 
variables  for  each  tuple  and  the  output  can  be  mapped  into  it  with  a  suitable  interchange 
of  variables.  Also  note  that  if  diji  is  equal  to  zero,  then  Eq.  2.30  becomes  a  computational 
primitive  for  the  M-D  FIR  DLSI  system.  On  the  left  side  of  the  equal  sign,  9,,/.(n)  can 
represent  the  current  value  of  the  state  variable  in  any  tuple  or  the  output  as  appropriate. 
The  state  variables  on  the  right  side  of  the  equal  sign  are  delayed  by  one  element  in  the 
respective  tuple.  Thus,  Eq.  2.30  is  a  generalization  of  the  2-D  computational  primitive  as 
given  in  Eq.  2.12.  Note  that  only  2  multiplications  and  a  maximum  oi  M  +  I  additions 
are  required  to  compute  any  of  the  state  variables  or  the  output  for  a  M-D  system. 


2.3  A  2-D  Example. 

Consider  the  second  order  2-D  digital  filter  with  transfer  function  given  by 

H{zr,Z2)  =  - —  (2.31) 

1  +  till 

ii=o  ji=o 

h+n>o 

Using  Eq.  2.30,  we  can  write  state  equations  as  follows: 


y{ni,n2) 

= 

9l.l(”l  -  1»”2) 

+  92,1  (”1»  ”2 

— 

1) 

= 

&0,o/(”l»^2) 

+ 

y(ni,n2) 

9l,l(«li”2) 

= 

bi,i/(ni,n2) 

+ 

auy("i>”2) 

+ 

91,2(1'*!  " 

- 1,1*2) 

91,2(^15  ”2) 

= 

6l,2/(«l5”2) 

+ 

al^y(ni,n2) 

= 

Q/(n,,n2) 

-1- 

arXni,n2) 

+ 

9i,4(ni  - 

- 1,02) 

qi, 4(^1, 712) 

= 

b\,4f('n\yn2) 

+ 

ar4y(”i»”2) 

9i.5(ni,n2) 

= 

l>i,sf{ni,n2) 

+ 

aT)5y(ni,»’2) 

+ 

9l,6(lll  * 

- 1,1*2) 

9i.6(ni,n2) 

= 

61, 6/(^1,  ”2) 

+ 

arXni,n2) 

92,i(’^i,n2) 

= 

&2.i/(ni,n2) 

+ 

driy(”i>"2) 

9l,6(lll  - 

-  1,1x2) 

+ 

92,2(^I»”2  "* 

1) 

92,i(ni,n2) 

= 

^/(ni.nj) 

-1- 

di3y(«i>”2) 

+ 

91,3(1*1  - 

- 1,1*2) 

(2.32) 
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The  coefficients  are  given  by 


f>0,Q 

Q 

fli.i 

il,2 

Ol,2 

^1,3 

fll.3 

C4 

^1,4 

Cs 

fll.S 

^1,6 

^1,6 

f*2,l 

024 

^2,2 

024 

Thus,  we  can  write 


Ai  = 


A2  = 


=  6(0,0) 

=  6(1,0)- 6(0, 0)a(l,0) 

=  -0(1, 0) 

=  6(2,0) -6(0, 0)a(2,0) 

=  -a(2,0) 

=  6(1,1)- 6(0, 0)a(l,l) 

=  -0(1,1) 

=  6(2,1)- 6(0, 0)a(2,l) 

=  -0(2,1) 

=  6(1,2)- 6(0, 0)a(l,2) 

=  -0(1,2) 

=  6(2,2)-6(0,0)a(2,2) 

=  -0(2,2) 

=  6(0,1)- 6(0, 0)a(0,l) 

=  -o(0,l) 

=  6(0,2)-6(0,0)a(0,2) 

=  -a(0,2) 


1 

0 

0 

0 

0  0 

0  ■ 

Ol,2 

0 

0 

0 

0 

0  0 

0 

Oi,3 

0 

0 

1 

0 

0  0 

0 

0 

0 

0 

0 

0  0 

0 

0 

0 

0 

0 

1  0 

0 

Ol,6 

0 

0 

0 

0 

0  0 

0 

02,1 

0 

1 

0 

0 

0  0 

0 

.  O24 

0 

0 

0 

1 

0  0 

0  . 

0 

0 

0 

0 

0 

0 

0  ■ 

0  0 

0 

0 

0 

0 

Ol,2 

0 

0  0 

0 

0 

0 

0 

0 

0  0 

0 

0 

0 

0 

0 

0  0 

0 

0 

0 

0 

0 

0  0 

0 

0 

0 

0 

Ol,6 

0 

0  0 

0 

0 

0 

0 

02,1 

1 

0 

0 

_ 1 

0 

0 

0 

0 

024 

0 

(2.33) 


(2.34) 


(2.35) 
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cl  =  [1  00000 

0  0] 

(2.37) 

c^  =  [0  00000 

1  0] 

(2.38) 

D  =  [6(0,0)] 

(2.39) 

2.4  Computation  of  Initial  Conditions. 

Many  practical  digital  systems  require  the  use  of  appropriate  boundary  values  or  initial 
conditions.  The  classical  approach  of  assigning  a  value  of  zero  to  boundary  values  often 
leads  to  undesirable  transients  during  initialization.  Our  approach  to  the  initial  condition 
problem  involves  the  state  space  model  and  the  estimation  of  the  initial  state  using  the 
constraint  that  the  state  does  not  change  upon  applying  the  initial  inputs  on  the  boundary. 
Using  this  constraint,  the  initial  state  can  be  determined  from  the  relationship 

Q(0)  =  g(-l,0,...,0)  =  C?(0,-1,0,...,0)  ...  (2.40) 

It  follows  from  the  use  of  Eq.  2.22  that 

Q(0)  =  £AiQ(0)  +  B/(0) 

»=i 

Af  ^ 

j(0)  =  EC|Q(0)  +  D/(0)  (2.41) 


Since 

EAi  =  A,  (2.42) 

•=i 

we  can  write 

0(0)  =  [/  -  Ar'B/(0) 

5(0)  =  [C(/  -  Ar'B  +  d)  /(O)  (2.43) 
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Thus,  we  can  compute  the  initial  state  vector  and  the  initial  output  by  using  only  the  initial 
input. 


Using  the  constraint  above  on  the  initial  conditions,  we  can  compute  the  state  and  the 
output  along  any  boundeiry.  Consider  the  use  of  this  constraint  along  the  boundary  where 
the  index  for  tuple  k  has  a  value  of  zero.  For  this  case,  we  have 

(5(ni,n2,...,nfc,...,nA/)  =  ”2. •  •  • » «*  “  •  •  •  > "w)  i  =  0  (2-44) 

The  state  equation  for  this  botmdary  is  given  by 

Q(n)  =  A^<5(n)  +  X^AiQ(n-r~^)  +  B/(n)  ;  n*  =  0  (2.45) 

•=i 

<^k 

Thus, 


0(n)  =  [/  -  Akf  SAi  Q{n-rr‘)  +  (/  -  Av]'‘B/(n)  ;  n,  =  0  (2,46) 

1=1 

i^k 

The  corresponding  output  equation  is  given  by 


^(n)  =  £Ci  Q(n-r-‘)  +  Ct  [/  -  aJ  ^Ai  Q(n-r-^) 

i=l  >=1 

i^k  i^k 

+  Ck[/  -  A;)'‘B/(n)  +  D/(n)  (2.47) 


2.4.1  Initial  Conditions  Example. 


We  now  show  the  computation  of  initial  conditions  for  a  second  order  2-D  HR  filter  as  an 
example.  We  derived  the  coefficient  matrices  for  this  case  in  a  previous  example.  The  state 
space  representation  for  this  filter  is  given  by 

Q(ni,n2)  =  MQ(ni-l,n2)  +  MQ{n2,n2-  1)  +  B/(nl,n2) 

^(ni,n2)  =  Ci<5(ni  -  l,n2)  +  C2(5(n2,n2  -  1)  +  D/(nl,Ti2)  (2.48) 

Let  the  numerator  polynomial  for  the  2-D  transfer  function  H{z)  be  given  by 

N{zi,22)  =  0.0427  -I-  0.0853zf^  -I-  0.0427zi-^ 

-t-  O.OSSSzj-^  +  0.1707zf*zj-*  -h  0.0853zf2z2'^ 

-I-  0.042722-’  +  0.0853zi-^22-’  -h  0.04272^^z^^ 


(2.49) 


Let  the  denominator  polynomial  for  H{x)  be  given  by 


d[zx,z2)  =  1.0  -  +  o.iosszf^ 

-  0.369522^  +  0.1366rf^22"‘  “  0.0724zi^z^^ 

+  0.195822~^  -  0.072iz-^z'^  +  0.0383zf*z^^  (2.50) 


The  coefficient  matrices  corresponding  to  Eq.  2.22  are  given  by 

■  0.3695  1  0  0  0  0  0.3695  0  ' 

-0.1958  0  0  0  0  0  -0.1958  0 

-0.1366  0  0  1  0  0  -0.1366  0 

^  _  0.0724  0  0  0  0  0  0.0724  0 

^  “  0.0724  0  0  0  0  1  0.0724  0 

-0.0383  0  0  0  0  0  -0.0383  0 

0.3695  0  1  0  0  0  0.3695  1 

_  -0.1958  0  0  0  1  0  -0.1958  0 

0.3695  1  0  0  0  0  0  0' 

-0.1958  0  0  0  0  0  0  0 

-0.1366  0  0  1  0  0  0  0 

0.0724  0  0  0  0  0  0  0 

0.0724  0  0  0  0  1  0  0 

-0.0383  0  0  0  0  0  0  0 

0.3695  0  1  0  0  0  0  0 

-0.1958  0  0  0  1  0  0  0. 

0  0  0  0  0  0  0.3695  0  ' 

0  0  0  0  0  0  -0.1958  0 

0  0  0  0  0  0  -0.1366  0 

0  0  0  0  0  0  0.0724  0 

0  0  0  0  0  0  0.0724  0 

0  0  0  0  0  0  -0.0383  0 

0  0  0  0  0  0  0.3695  1 

0  0  0  0  0  0  -0.1958  0 


(2.51) 


(2.52) 


(2.53) 


The  initial  state  vector  at  ni  =  nj  =  0  is  given  by 


g(0,0)  =  [I  -  ArB/(0,0) 


The  corresponding  initial  output  is  given  by 

j(0,0)  =  [icl  +  c;i[i  -  Aj-'B  +  d]  /(0,0) 


(2.54) 

(2.55) 
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Thus, 

^(0,0)  =  /(0,0)  (2.56) 

where 

[[Cl  +  C^][I  -  Aj-^B  +  d]  =  1.0  (2.57) 

For  the  first  row,  we  have 

Q{nu-1)  =  QinuO)  (2.58) 

Using  this  assumption,  we  obtain 

g(ni,0)  =  [I  -  A^]-'  Alg(ni  -  1,0)  +  [I  -  A^]-'  B  /(ni,0)  (2.59) 

The  corresponding  output  equation  is  given  by 

g{nx,n2)  =  [Cl  +  C^[I  -  A;]-']g(ni-l,0)  +  [C^fl  -  A^l'^B  +  B]f{n„0)  (2.60) 


Let 

Ah  =  (I  -  A2]"^  Ai  , 

Bh  =  [I  -  A^]-'  B  , 

Ch  =  [c7  +  ci[i  -  A^)-'  , 

and 

Dh  =  [C^[I  -  ^]~'B  +  D] 

Then,  the  state  space  representation  for  the  first  row  can  be  written  as 

Q(7Zi,0)  =  AhQ(ni  -  1,0)  +  Bh/(ni,0) 
^(ni,0)  =  ChQ(ni  -  1,0)  +  Dh/(ni,0) 

For  our  example,  we  have 


0.44721360 

1 

0.44721360 

0 

0.44721360 

0 

0 

0 

-0.23698230 

0 

-0.23698230 

0 

-0.23698230 

0 

0 

0 

-0.16525767 

0 

-0.16525767 

1 

-0.16525767 

0 

0 

0 

0.08757145 

0 

0.08757145 

0 

0.08757145 

0 

0 

0 

0.08757145 

0 

0.08757145 

0 

0.08757145 

1 

0 

0 

-0.0464048^ 

0 

-0.04640486 

0 

-0.04640486 

0 

0 

0 

0.02102313 

0 

1.21023130 

0 

1.21023130 

0 

0 

0 

-0.23698230 

0 

-0.23698230 

0 

0.76301770 

0 

0 

0 

(2.61) 

(2.62) 

(2.63) 

(2.64) 


(2.65) 


(2.66) 
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0.16167809  ■ 

0.00222197 
0.14248059 
0.10029146 
0.10029146 
0.00347517 
0.16390006 
0.00222197 

Ch  =  [1.2102313  0  1.2102313  0  1.2102313  0  0  0] 

For  the  first  column,  we  have 

Q{-l,n2)  =  (?(0,n2)  (2.69) 

Using  this  assumption,  we  obtain 

g(0,n2)  =  [I  -  Al]-^  MQi0,n2-l)  +  [I  -  aI]"'  B/(0,n2)  (2.70) 

The  corresponding  output  equation  is  given  by 

^(0,n2)  =  [C^  +  ^[I  -  Alng(0,n2-1)  +  [Cl[I  -  +  D]/(0,n2)  (2.71) 

Let 

Av  =  [I  —  All"'  A2  , 

By  ^  [I  B  , 

Cy  =  [C2  +  Cl  [I  —  Ai]“^  , 

and 

Dv  =  [Cl[I  -  aI]-'B  +  D] 

Then,  the  state  space  representation  for  the  first  column  can  be  written  as 

^(0,712)  =  AvQ(0, 712  —  1)  +  Bv/(0, 712) 

5(0,712)  =  CvQ(0, 712  —  1)  +  Dv/(0,7i2)  (2.76) 

For  our  example,  we  have 

■  0  0  0  0  0  0  0.21023129  0  ' 

0  0  0  0  0  0  -0.23698230  0 

0  0  0  0  0  0  -0.07768622  0 

^  000000  0.08757145  0 

0  0  0  0  0  0  0.04116659  0 

0  0  0  0  0  0  -0.04640486  0 

0  0  0  0  0  0  0.36952738  1 

.0  0  0  0  0  0  -0.19581571  0 


(2.72) 

(2.73) 

(2.74) 

(2.75) 


(2.67) 


(2.68) 


■  0.16390006  ■ 

0.00222197 
0.24277204 
^  0.10029146 

~  0.13504272 

0.03475127 
0.40445013 
,  0.13726469  . 

Cv  =  [0  0  0  0  0  0  1.3695274  1] 
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(2.78) 


(2.79) 


Chapter  3 


The  Block  Date  Flow  Architecture. 


3.1  Introduction. 


Two-dimensional  digital  filtering  is  one  of  the  important  applications  of  2-D  DLSI  systems. 
We  need  a  multiprocessor  system  to  implement  2-D  digital  filters  in  real-time  at  rates  ap¬ 
propriate  for  image  display.  We  began  our  research  on  the  real-time  implementation  of  2-D 
digital  filters  by  exporing  the  design  of  a  special  purpose  multiprocessor  system  for  this  pur¬ 
pose.  We  la-.  i  explored  the  potential  for  increasing  the  programmability  of  our  design  to 
solve  other  problems.  Thus,  we  derived  the  BDFA.  We  are  currently  exploring  the  mapping 
of  other  problems  to  the  BDFA. 


3.2  The  BDFA  Configuration 


A  BDFA  system  consists  of  three  modules:  an  Input  Control  Module  (ICM),  a  Processor 
Array  (PA),  and  an  Output  Control  Module  (OCM)  as  shown  in  Figure  3.1. 


3.2.1  Input  Control  Module 

The  ICM  serves  as  a  buffer  between  the  host  system  (or  an  input/output  device)  and  the 
processor  array.  It  includes  two  FIFO  buffers  and  it  converts  the  input  data  stream  into 
blocks  of  data.  It  maintains  a  direct  input  channel  to  each  processor.  Designated  data 
blocks  are  sent  to  each  processor  through  these  channels  without  any  interference  from  other 
processors.  A  control  logic  submodule  provides  each  processor  with  control  for  data  man¬ 
agement  and  communication  services.  The  block  diagram  of  the  ICM  is  shown  in  Figure  3.2. 
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figure  3.1:  Block  Diagram  of  the  BDFA 

3.2.2  The  Processor  Array 


The  PA  contains  enough  processors  to  provide  the  computational  power  required  for  real-time 
signal  processing  and  fast  matrix  operations.  Since  we  limit  the  interprocessor  communica¬ 
tions  to  being  local  and  in  one  direction,  we  can  simply  connect  all  the  processors  together 
to  form  a  linear  array.  Each  processor  has  a  separate  input  channel  and  a  separate  output 
channel.  The  processors  are  divided  into  two  processor  groups:  an  odd  number  processor 
group  and  an  even  number  processor  group.  Each  processor  group  is  directly  connected  to 
an  input  FIFO  buffer  and  an  output  FIFO  buffer.  Therefore,  each  processor  always  uses 
the  same  input  and  output  FIFO.  Finally,  FIFO  buffers  are  used  for  interprocessor  commu¬ 
nications  to  minimize  overhead  due  to  addressing  and  routing.  The  block  diagram  of  the 
processor  array  is  shown  in  Figure  3.3. 


3.2.3  The  Output  Control  Module 
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The  OCM  consists  of  a  control  logic  submodule,  a  submodule  for  post-processing,  and  two 
output  FIFO  buffers.  It  collects  processing  results  from  each  processing  element  and  converts 
the  blocks  of  data  into  a  synchronized  output  data  stream.  It  provides  each  processor  with 
data  management  and  communication  services.  The  post-processing  submodule  also  may 
implement  different  dynamic  scaling  algorithms  for  signal  processing.  It  collects  overflow 
information  from  each  processor  and  adjusts  the  system  gain  based  on  this  information.  For 
example,  the  system  gain  factor  may  be  fed  back  to  the  processor  array  at  the  end  of  each 
frame.  A  scale  memory  also  can  be  used  as  a  “look-up  table”  to  scale  the  output  for  a 
particular  output  device.  The  post-processing  submodule  is  very  flexible  and  can  contain 
different  function  modules  for  specific  applications.  The  block  diagram  of  an  OCM  is  shown 
in  Figure  3.4. 


Figure  3.4:  Output  Control  Module 


3.3  Architectural  Features  of  a  BDFA 
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The  architectural  features  of  a  BDFA  are: 


•  block  data  processing  and  the  block  data  flow  paradigm, 

•  globally  asynchronous  and  locally  synchronous  data  transmission  protocol, 

•  linear  array  topology  and  “skew”  operations  among  processors, 

•  local  data  transmission  in  only  one  direction,  and 

•  overlap  of  data  movement  and  interprocessor  communication  with  data  computations. 

Large  scale  tasks  can  be  divided  into  smaller  tasks  using  either  an  algorithm  parti¬ 
tioning  strategy  or  a  data  partitioning  strategy.  With  an  algorithm  partitioning  strategy,  a 
complex  algorithm  is  decomposed  into  a  sequence  of  simple  operations.  Each  simple  opera¬ 
tion  or  group  of  operations  is  assigned  to  a  different  processor.  With  the  data  partitioning 
strategy  the  whole  image  or  matrix  is  divided  into  data  blocks  and  each  data  block  is  as¬ 
signed  to  a  different  processor.  Each  processor  is  capable  of  performing  all  the  required 
functions  for  the  assigned  data  blocks.  The  algorithm-partitioning  strategy  can  simplify  the 
structure  of  each  processing  element.  However,  the  processors  in  an  algorithm  partitioned 
system  cannot  operate  independently  and  they  are  subject  to  timing,  sequencing  and  data 
dependency  restrictions. 

In  the  BDFA,  we  adopted  the  data  partitioning  strategy  at  the  high  level  to  build 
an  alternative  structure  with  independent  processors.  The  more  independent  the  processors 
are,  the  less  time  required  to  implement  data  communications  protocols  or  to  synchronize 
data  movements.  A  structure  with  independent  processors  is  also  more  flexible  in  coping 
with  a  variety  of  algorithms  with  different  operational  requirements. 

The  second  advantage  of  the  data  partitioning  strategy  is  the  reduction  of  unneces¬ 
sary  data  movement  between  the  processors.  Input  data  goes  directly  to  the  processor  that 
will  use  it.  The  interprocessor  communications  are  limited  to  passing  necessary  interme¬ 
diate  computational  results.  Output  results  go  directly  from  each  processor  to  an  output 
device  without  any  interference  to  or  from  other  processors.  Additionally,  block  data  pro¬ 
cessing  provides  the  opportunity  for  intermediate  computational  results  to  be  used  locally. 
A  large  reduction  in  interprocessor  data  communications  can  have  a  tremendous  impact  on 
the  hardware  implementation. 
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3.3.1  Block  Data  Flow 

The  BDFA  implements  the  block  data  flow  paradigm  to  achieve  maximum  parallelism  at 
the  processor  level.  Incontestably,  we  need  many  processing  elements  working  together  to 
increase  computational  power.  The  use  of  the  Von  Neumann  computation  model  restricts 
the  full  utilization  of  all  of  the  processing  elements.  The  management  of  and  the  contention 
for  the  globally  addressable  memory  necessary  with  the  Von  Neumann  structure  also  limits 
effective  parallelism. 

The  data  flow  model  is  different  from  the  Von  Neumann  computation  model.  Data 
flow  processors  are  stored-program  computers.  If  sufficient  resources  are  provided,  the  system 
can  exploit  all  concurrency  present  in  the  program.  This  approach  can  be  naturally  extended 
to  an  arbitrary  number  of  processors[14].  This  concept  also  reduces  the  data  dependency, 
control  dependency,  and  resource  dependency  among  processors.  However,  it  is  difficult  to 
manage  the  data  flow  model  for  a  multiprocessor  system[15].  We  implemented  the  data  flow 
paradigm  at  the  processor  array  level  with  a  large  data-block-grain  and  limited  our  array 
to  being  linear.  With  these  restrictions,  we  have  successfully  implemented  the  data  flow 
paradigm  for  the  BDFA. 

When  a  processor  in  a  BDFA  system  has  received  its  assigned  data  block  and  all 
of  its  intermediate  data,  then  the  processor  is  able  to  perform  its  designated  functions  in¬ 
dependently.  When  data  blocks  and  the  necessary  intermediate  data  are  available  for  all 
processors,  then  all  processors  are  able  to  perform  their  designated  functions  on  their  own 
data  blocks  independently. 

The  use  of  the  block  data  flow  paradigm  also  helps  us  to  reduce  data  storage  require¬ 
ments.  In  a  BDFA  system,  the  input  data  blocks  flow  into  the  system  and  the  output  data 
blocks  flow  out  of  the  system.  There  is  no  need  to  store  the  whole  frame  of  the  image  or  all 
the  entries  of  a  matrix  into  a  BDFA  system. 


3.3.2  Data  Transmission  Protocol 


Data  transmission  protocols  may  be  categorized  into  synchronous  data  transmission  pro¬ 
tocols  and  asynchronous  data  transmission  protocols.  The  synchronous  data  transmission 
protocol  is  fast  and  simple  and  there  is  no  handshaking  overhead.  However,  the  synchronous 
data  transmission  protocol  places  a  timing  restriction  on  the  system  design.  In  particular, 
this  can  be  a  problem  for  large-scale  systems.  The  ^«ynchronous  data  transmission  protocol 
does  not  have  this  timing  restriction.  However,  there  is  a  considerable  amount  of  over¬ 
head  associated  with  the  asynchronous  data  transmission  protocol.  The  BDFA  system  uses 
a  globally  asynchronous  data  transmission  protocol  with  a  large  data  grain  and  a  locally 
synchronous  data  transmission  protocol  with  small  data  elements  to: 
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•  avoid  the  globally  synchroaous  transmission  problem, 

•  minimize  overhead  due  to  aisynchronous  handshaking  signals, 

•  reduce  communications  control  hardware,  and 

•  minimize  data  communications  overhead. 


3.3.3  Linear  Array  Topology  and  Skew-Operations 

« 

Since  the  BDFA  uses  the  data  partitioning  strategy,  the  interprocessor  communications  have 
been  limited  to  only  passing  intermediate  computational  results.  We  restrict  the  interpro¬ 
cessor  communications  to  be  local  and  only  in  one-direction  to  make  the  implementation 
of  the  data  flow  paradigm  feasible.  This  means  we  can  simply  connect  the  processors  to¬ 
gether  to  form  a  linear  array.  The  linear  array  topology  is  simple  and  has  efficient  channel 
utilization[16].  Furthermore,  the  linear  array  topology  allows  us  to  skew  the  operations 
among  the  processors[17].  Allowing  the  operations  to  be  skewed  among  processors  plays  an 
important  role  in  balancing  the  system  input/output  bandwidth  and  the  computational  in¬ 
tensity.  It  also  helps  to  reduce  the  storage  requirements  of  the  interprocessor  communication 
buffers. 


3.3.4  Data  Communications 

A  BDFA  system  overlaps  the  input/output  data  movements  and  the  interprocessor  commu¬ 
nication  with  data  computations.  This  is  possible  because  the  ICM  takes  care  of  routing  the 
input  data  block  to  the  appropriate  processor  as  soon  m  it  is  available  and  the  OCM  always 
provides  an  output  FIFO  whenever  a  processor  needs  to  output  a  block  of  processing  results. 
Therefore,  the  processors  are  able  to  devote  almost  100%  of  their  time  to  computations. 
Consequently,  the  system  achieves  high  throughput  and  high  eflBciency. 

In  addition,  a  BDFA  system  has  all  of  the  advantageous  features  of  a  systolic  array  or 
wavefront  array.  This  includes  such  features  as  modularity,  regularity,  local  interconnection, 
highly  pipelined  multiprocessing,  highly  parallel  processing  at  the  array  level,  and  a  balance 
of  external  I/O  and  computational  intensity.  The  BDFA  system  is  also  able  to  use  a  systolic 
array  or  a  wavefront  array  for  its  processing  elements. 


3.3.5  The  BDFA  Mapping  Criteria 


We  established  three  criteria  for  mapping  algorithms  to  a  BDFA.  These  three  criteria  are: 
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•  the  algorithm  must  be  data  partitionable, 

•  data  communications  must  be  local  and  in  only  one  direction,  and 

•  the  computational  load  must  be  balanced  among  the  processors. 


The  requirement  for  the  algorithm  to  be  data  partitionable  lays  the  foundation  for  block  data 
processing.  The  requirement  for  local  uni-directional  interprocessor  communication  makes 
it  easy  to  implement  the  block  data  flow  paradigm.  Basically,  any  algorithm  which  conforms 
to  these  two  criteria  can  be  implemented  on  a  BDFA.  The  third  criterion,  the  computational 
load  balance,  sometimes  is  hard  to  achieve  because  of  the  variety  of  computational  require¬ 
ments  of  different  algorithms.  However  this  criterion  only  affects  the  system’s  hardware 
efficiency.  In  some  applications,  the  hardware  efficiency  is  not  very  critical  and  the  system 
throughput  is  of  most  concern.  Thus,  if  an  algorithm  meets  the  first  two  BDFA  mapping 
criteria  but  not  the  third  one,  it  still  can  be  implemented  on  a  BDFA  with  high  throughput. 
In  addition,  these  criteria  are  not  very  restrictive  and  many  algorithms  can  meet  this  criteria 
or  can  be  adapted  to  this  to  meet  this  criteria. 

We  have  been  able  to  map  the  following  algorithms  onto  a  BDF.A.: 


•  2-D  digital  HR  filter[8], 

•  2-D  digital  FIR  filter{8], 

•  orthogonal  transformation  of  a  dense  matrix  using  Givens  rotations[7], 

•  updating  and  down-dating  for  the  least  square  problem  based  on  an  inverse  QR 
decomposition[7], 

•  lower-upper  (LU)  decomposition  of  a  dense  matrix[7],  and 

•  2-D  discrete  cosine  transform[8]. 


3.3.6  Performance  Evaluation 

The  BDFA  was  developed  as  a  part  of  our  efforts  to  implement  2-D  HR  filters  in  real-time 
[12], [8], [18].  As  a  part  of  this  effort,  we  designed  a  special  purpose  node  processor  [18]  and 
we  developed  a  multiprocessor  system  which  uses  this  processor  to  implement  2-D  IIR  filters 
in  real-time  [7].  We  refer  to  the  special  purpose  node  processor  as  a  2-D  DSP.  In  this  section, 
we  summarize  our  simulation  results  on  the  performance  evaluation  of  this  multiprocessor 
system  as  an  example  of  the  expected  performance  of  a  BDFA  system. 
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Table  3.1:  Performance  of  Systems  with  a  Different  Number  of  Processors 


N 

initialization 

latency 

wait 

throughput 

10 

3492 

1418 

4 

0.9696 

6 

2129 

1418 

505 

0.5940 

2 

765 

1418 

1028 

0.1988 

Table  3.1  and  table  3.2  give  the  functional  level  simulation  results  for  the  2-D  HR 
digital  filter  BDFA  system.  In  these  tables,  “initialization”  refers  to  the  number  of  cycles 
required  to  initialize  the  system  and  load  filter  coefficients,  “latency”  refers  to  the  interval  in 
cycles  between  the  time  a  processor  receives  its  eissigned  data  block  aiid  the  time  it  begins 
transferring  its  output  to  the  OCM,  and  “wait”  refers  to  the  number  of  cycles  between  output 
data  blocks.  We  consider  the  system  to  be  performing  in  real-time  when  there  is  always  a 
processor  ready  to  receive  an  input  block  when  it  is  ready. 

Table  3.1  shows  the  performance  of  a  second  order  system  with  a  different  number 
of  2-D  DSPs.  The  size  of  the  sample  image  is  128  x  128  pixels.  This  table  reveals  that 
the  BDFA  system  can  perform  2-D  HR  digital  filtering  in  real-time  and  that  it  essentially 
achieves  a  linear  speed-up  rate.  The  relative  system  throughput  (the  ratio  of  output  pixels 
over  system  cycles  needed)  for  a  ten-processor  system  is  very  close  to  1  (0.9696).  The  relative 
system  throughput  of  a  six-processor  system  is  very  close  to  0.6  (0.5940).  The  relative  system 
throughput  of  a  two-processor  system  is  very  close  to  0.2  (0.1988).  The  system  throughput 
of  the  ten-processor  system  is  about  5  times  as  high  as  the  system  throughput  of  the  two- 
processor  system  and  1.6  times  as  high  as  the  throughput  for  the  six-processor  system.  This 
means  the  system  throughput  is  proportionally  increased  with  the  increase  of  the  number  of 
processors  until  real-time  processing  is  achieved.  Thus,  the  BDFA  system  almost  achieves  a 
linear  speed-up  rate. 


Table  3.2:  Ten-Processor  System’s  Performance  on  Images  with  Different  Sizes 


size 

initialization 

latency 

wait 

throughput 

512  X  512 

3492 

5642 

4 

0.9922 

256  X  256 

3492 

2816 

4 

0.9846 

128  X  128 

3492 

1418 

4 

0.9696 

64  X  64 

3492 

714 

4 

0.9411 

16  X  16 

3492 

00 

4 

0.8000 

Table  3.2  shows  a  ten-processor  system  processing  images  with  different  sizes.  The 
system  achieves  its  maximum  throughput  when  it  processes  the  image  with  the  largest  pos¬ 
sible  data  block  size.  The  system  processes  all  the  images  in  real-time.  This  indicates  the 
number  of  processors  needed  for  real-time  processing  is  independent  of  the  processed  image 


Commercial  DSPs,  such  as  the  Motorola  DSP56000,  and  general-purpose  processors, 
such  as  the  Intel  80486,  can  be  used  as  processing  elements  in  a  BDFA  system.  The  system's 
throughput  will  increase  proportionally  with  the  number  of  processors  in  the  system  due  to 
the  characteristics  of  the  BDFA.  Table  3.3  shows  the  number  of  cycles  needed  to  compute 
the  output  of  a  pixel  element  using  different  processors  in  a  BDFA  system. 


Table  3.3:  The  Number  of  Cycles  for  Different  Processors 


order 

2-D  DSP 

DSP56000 

80486 

2 

10 

36 

273 

4 

26 

100 

785 

8 

82 

324 

2577 

3.4  Conclusions 


High  system  throughput  and  high  system  efficiency  are  the  key  requirements  for  many  real¬ 
time  signal  processing  and  fast  matrix  operation  applications.  The  BDFA  provides  an  alter¬ 
native  multiprocessor  system  architecture  for  high  throughput  and  high  efficiency. 
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